Integrated Analysis: Lost Wallets, PISA Scores, and Social Capital Measures#

This notebook combines exploratory data analysis (EDA) of the Lost Wallets dataset with an investigation into the relationship between wallet reporting rates, PISA educational achievement scores, and other survey-based social capital measures.

The analysis is based on the work by Dave Tannenbaum (Lost Wallets Social Capital Project) and further explores the connections with international education data (PISA).

Initial Findings and Context (from wallet_pisa.ipynb)#

In Dave Tannenbaum’s paper, it is shown that:

  • Wallet report rates are, in fact, correlated with survey measures of social capital, showing that survey data contains useful information.

  • Wallet report rates serve as a predictor in models of “Economic and Institutional Performance”, suggesting that social capital as measured by wallet report rates can “explain variation in economic development”.

This notebook investigates how education data (PISA scores) relates to these findings. The PISA 2022 reading data was primarily used (source: OECD PISA 2022 Results Vol I, pp. 54-55). It’s noted that PISA 2022 data has missing observations for some countries (China, Ghana, India, Kenya, Russia, South Africa). The 2018 PISA data includes China and Russia; China is an outlier with high education scores and low wallet reporting rates. Tannenbaum’s paper also noted China as a special case. East Asian countries are generally underrepresented in the wallet dataset.

1. Setup and Data Loading#

First, we import necessary libraries and load the primary dataset.

import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from scipy.stats import chi2_contingency
import numpy as np
from pandas.api.types import is_numeric_dtype
import statsmodels.api as sm
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# pd.set_option('future.no_silent_downcasting', True)
pd.options.mode.chained_assignment = None # default='warn', suppress warning for now
# Define categorical and numerical features for clarity and later use
numeric_features_initial = [
    "general_trust",
    "GPS_trust",
    "general_morality",
    "MFQ_genmorality",
    "civic_cooperation",
    "GPS_posrecip",
    "GPS_altruism",
    "stranger1",
]

cat_cols = [
    "country",
    "response", # Target variable, will be handled separately for modeling
    "male",
    "above40",
    "computer",
    "coworkers",
    "other_bystanders",
    "institution",
    "cond",
    "security_cam",
    "security_guard",
    "local_recipient",
    "no_english",
    "understood_situation",
]

sc_measures = [
    "log_gdp",
    "log_tfp",
    "gee",
    "letter_grading",
]

# Load the dataset
# Assuming the data is in a subdirectory 'lost-wallets-social-capital/data/' relative to the notebook
# If not, adjust the path accordingly.
try:
    df = pd.read_csv(
        "../../lost-wallets-social-capital/data/data_for_ml.csv",
        dtype={col: "category" for col in cat_cols if col != 'response'} # Keep response as object for now, convert later
    )
    df['response'] = df['response'].astype('category') # Now convert response
except FileNotFoundError:
    print("Dataset 'data_for_ml.csv' not found. Please ensure the path is correct.")
    print("You might need to download it from: https://github.com/davetannenbaum/lost-wallets-social-capital")
    df = pd.DataFrame() # Create an empty dataframe to avoid further errors if file not found

if not df.empty:
    print("Dataset loaded successfully.")
    df.head()
Dataset loaded successfully.

2. Exploratory Data Analysis (EDA)#

Visualizing the distributions of individual features and their relationships.

2.1. Feature Distributions#

Histograms for continuous variables and bar plots for categorical variables.

def subplot_subset(data_frame, cols_to_plot, n_subplot_cols=4):
    """Make a grid of histograms or bar plots for specified columns."""
    if data_frame.empty or not cols_to_plot:
        print("Dataframe is empty or no columns to plot.")
        return
        
    n_features = len(cols_to_plot)
    n_rows = (n_features + n_subplot_cols - 1) // n_subplot_cols # Calculate number of rows needed
    
    fig = make_subplots(rows=n_rows, cols=n_subplot_cols, subplot_titles=cols_to_plot)
    
    for i, feature in enumerate(cols_to_plot):
        row_idx = (i // n_subplot_cols) + 1
        col_idx = (i % n_subplot_cols) + 1
        
        if pd.api.types.is_numeric_dtype(data_frame[feature]):
            fig.add_trace(
                go.Histogram(x=data_frame[feature], name=feature),
                row=row_idx,
                col=col_idx,
            )
        else: # Categorical
             # For bar plots, we need counts of each category
            counts = data_frame[feature].value_counts()
            fig.add_trace(
                go.Bar(x=counts.index, y=counts.values, name=feature),
                row=row_idx,
                col=col_idx,
            )
            
    fig.update_layout(title_text="Feature Distributions", height=300*n_rows, showlegend=False)
    fig.show()

if not df.empty:
    # Visualize first set of features (example)
    # The original script plots columns by index, which can be fragile.
    # It's better to explicitly list features or groups of features.
    print("Displaying distributions for numeric features:")
    subplot_subset(df, numeric_features_initial)
    
    # Example for some categorical features (excluding country and response for brevity here)
    categorical_subset_viz = [col for col in cat_cols if col not in ['country', 'response']][:8] # First 8 for example
    print("\nDisplaying distributions for a subset of categorical features:")
    subplot_subset(df, categorical_subset_viz)
    
    print("\nDisplaying distributions for social capital measures:")
    subplot_subset(df, sc_measures)
Displaying distributions for numeric features:
Displaying distributions for a subset of categorical features:
Displaying distributions for social capital measures:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def subplot_subset_seaborn(data_frame, cols_to_plot, n_subplot_cols=4):
    """Make a grid of histograms or bar plots for specified columns using Seaborn."""
    if data_frame.empty or not cols_to_plot:
        print("Dataframe is empty or no columns to plot.")
        return

    n_features = len(cols_to_plot)
    n_rows = (n_features + n_subplot_cols - 1) // n_subplot_cols

    fig, axes = plt.subplots(n_rows, n_subplot_cols, figsize=(5 * n_subplot_cols, 4 * n_rows))
    axes = axes.flatten()

    for i, feature in enumerate(cols_to_plot):
        ax = axes[i]
        if pd.api.types.is_numeric_dtype(data_frame[feature]):
            sns.histplot(data=data_frame, x=feature, ax=ax, kde=False)
        else:
            sns.countplot(data=data_frame, x=feature, ax=ax,
                          order=data_frame[feature].value_counts().index)
            ax.tick_params(axis='x', rotation=45)

        ax.set_title(feature)

    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    fig.suptitle("Feature Distributions", fontsize=16)
    plt.tight_layout(rect=[0, 0.03, 1, 0.95])
    plt.show()

# Example usage
if not df.empty:
    print("Displaying distributions for numeric features:")
    subplot_subset_seaborn(df, numeric_features_initial)

    categorical_subset_viz = [col for col in cat_cols if col not in ['country', 'response']][:8]
    print("\nDisplaying distributions for a subset of categorical features:")
    subplot_subset_seaborn(df, categorical_subset_viz)

    print("\nDisplaying distributions for social capital measures:")
    subplot_subset_seaborn(df, sc_measures)
Displaying distributions for numeric features:
_images/d1d6d7062a8814027ea7270aa3e192295d230bbc9c226ac837b34370facb4a92.png
Displaying distributions for a subset of categorical features:
_images/9deb3014c3d1eee3cca9bff10e6c7e7af8e23c9846c39fed93ae2585748331d2.png
Displaying distributions for social capital measures:
_images/8d19de752cd1c0874aae12f41f30fc71adb9a891bb4ec29b61f856089434c377.png

2.2. Generalized Association Matrix#

To understand relationships between pairs of variables, we compute a generalized correlation matrix. This uses Pearson’s correlation for numeric-numeric pairs, correlation ratio for numeric-categorical pairs, and Cramer’s V for categorical-categorical pairs.

def cramers_v(ex, y_cramer):
    """Cramer correlation between two categorical variables."""
    # Ensure no NaN values, as crosstab will not handle them well for chi2
    ex_nona = ex.dropna()
    y_cramer_nona = y_cramer.dropna()
    # Align indices if they are different after dropping NaNs
    common_index = ex_nona.index.intersection(y_cramer_nona.index)
    ex_nona = ex_nona.loc[common_index]
    y_cramer_nona = y_cramer_nona.loc[common_index]
    
    if ex_nona.empty or y_cramer_nona.empty:
        return 0.0 # Or np.nan if preferred for empty inputs
        
    confusion_matrix = pd.crosstab(ex_nona, y_cramer_nona)
    if confusion_matrix.empty or confusion_matrix.shape[0] == 0 or confusion_matrix.shape[1] == 0:
        return 0.0
        
    chi2_cramers = chi2_contingency(confusion_matrix, correction=False)[0]
    n = confusion_matrix.sum().sum()
    if n == 0:
        return 0.0
        
    phi2 = chi2_cramers / n
    r, k = confusion_matrix.shape
    if min(k - 1, r - 1) == 0:
        return 0.0 # Avoid division by zero if a variable has only one category after filtering
    return np.sqrt(phi2 / min(k - 1, r - 1))

def correlation_ratio(values, categories):
    """Correlation ratio between a categorical and numeric variable."""
    na_mask = values.notna() & categories.notna()
    num_var = values[na_mask]
    cat_var = categories[na_mask]

    if num_var.empty or cat_var.empty or len(cat_var.unique()) < 2:
        return 0.0

    fcat, _ = pd.factorize(cat_var)
    cat_num = np.max(fcat) + 1
    y_avg = num_var.mean()

    numerator = sum(
        len(num_var[fcat == i]) * (num_var[fcat == i].mean() - y_avg) ** 2
        for i in range(cat_num) if len(num_var[fcat == i]) > 0
    )
    denominator = np.sum((num_var - y_avg) ** 2)

    return np.sqrt(numerator / denominator) if denominator != 0 else 0.0

# Compute the generalized association matrix and plot it
if not df.empty:
    df_cols_assoc = df.columns.drop("country", errors="ignore")
    assoc = pd.DataFrame(index=df_cols_assoc, columns=df_cols_assoc, dtype=float)

    for col1 in df_cols_assoc:
        for col2 in df_cols_assoc:
            if col1 == col2:
                assoc.loc[col1, col2] = 1.0
            elif is_numeric_dtype(df[col1]) and is_numeric_dtype(df[col2]):
                corr_val = df[[col1, col2]].corr().iloc[0, 1]
                assoc.loc[col1, col2] = corr_val if pd.notna(corr_val) else 0.0
            elif is_numeric_dtype(df[col1]) and not is_numeric_dtype(df[col2]):
                assoc.loc[col1, col2] = correlation_ratio(df[col1], df[col2])
            elif is_numeric_dtype(df[col2]) and not is_numeric_dtype(df[col1]):
                assoc.loc[col1, col2] = correlation_ratio(df[col2], df[col1])
            else:
                assoc.loc[col1, col2] = cramers_v(df[col1], df[col2])

    assoc = assoc.astype(float)

    # Plot using Seaborn
    plt.figure(figsize=(12, 10))
    sns.heatmap(
        assoc,
        cmap="RdBu_r",
        center=0,
        annot=False,
        fmt=".2f",
        square=True,
        linewidths=0.5,
        cbar_kws={"label": "Association"},
    )
    plt.title("Association Matrix (Generalized Correlation)", fontsize=16)
    plt.xticks(rotation=45, ha="right")
    plt.yticks(rotation=0)
    plt.tight_layout()
    plt.show()

else:
    print("DataFrame is empty, skipping association matrix.")

# if not df.empty:
#     df_cols_assoc = df.columns.drop("country") # Excluding country for this matrix, can be too many categories
#     assoc = pd.DataFrame(index=df_cols_assoc, columns=df_cols_assoc, dtype=float)

#     for col1 in df_cols_assoc:
#         for col2 in df_cols_assoc:
#             if col1 == col2:
#                 assoc.loc[col1, col2] = 1.0
#             elif is_numeric_dtype(df[col1]) and is_numeric_dtype(df[col2]):
#                 # Both numeric:
#                 corr_val = df[[col1, col2]].corr().iloc[0, 1]
#                 assoc.loc[col1, col2] = corr_val if pd.notna(corr_val) else 0.0
#             elif is_numeric_dtype(df[col1]) and not is_numeric_dtype(df[col2]):
#                 # First numeric, second categorical:
#                 assoc.loc[col1, col2] = correlation_ratio(df[col1], df[col2])
#             elif is_numeric_dtype(df[col2]) and not is_numeric_dtype(df[col1]):
#                 # First categorical, second numeric:
#                 assoc.loc[col1, col2] = correlation_ratio(df[col2], df[col1])
#             else:
#                 # Both categorical:
#                 assoc.loc[col1, col2] = cramers_v(df[col1], df[col2])

#     assoc = assoc.astype(float)

#     fig_assoc = px.imshow(
#         assoc,
#         labels=dict(x="Features", y="Features", color="Association"),
#         color_continuous_scale="RdBu_r",
#         zmin=-1,
#         zmax=1,
#         aspect="auto",
#     )
#     fig_assoc.update_layout(title_text="Association Matrix (Generalized Correlation)", height=800)
#     fig_assoc.show()
# else:
#     print("DataFrame is empty, skipping association matrix.")
_images/b45eadfb7ed438e090a45229551268ac1e5af9d1707f1e111ef50dda0347aadf.png

2.3. Pairplot for Numeric Features and Response#

Visualizing relationships between numeric features, colored by the ‘response’ variable.

if not df.empty:
    # Ensure 'response' is suitable for hue (categorical)
    # If response is numeric (e.g. 0, 100), convert to string for distinct categories in pairplot hue
    df_pairplot = df.copy()
    if pd.api.types.is_numeric_dtype(df_pairplot['response']):
         df_pairplot['response'] = df_pairplot['response'].astype(str)
    
    print("Displaying pairplot for numeric features, colored by response category.")
    # Adding a check for sufficient data points to avoid errors with pairplot
    if len(df_pairplot) > 1 and len(numeric_features_initial) > 0:
        pairplot_fig = sns.pairplot(df_pairplot[numeric_features_initial + ["response"]], hue="response", diag_kind='kde')
        pairplot_fig.fig.suptitle("Pairplot of Numeric Features by Wallet Response", y=1.02)
        # plt.show() # Use this if running in a local environment where seaborn doesn't auto-show
    else:
        print("Not enough data or numeric features to generate pairplot.")
else:
    print("DataFrame is empty, skipping pairplot.")
Displaying pairplot for numeric features, colored by response category.
_images/a7eb9adc2958355e6596e511a18b03ae739901f416cfd207c0ca5aedd9adc7d8.png

3. PISA Education Data Integration and Analysis#

Loading PISA reading scores and merging them with the wallet dataset to explore correlations.

3.1. Loading PISA Data#

We load PISA reading scores for 2022 and 2018. The 2022 data is primary, while 2018 is used for countries missing in the 2022 set if needed (though the original script mainly focuses on 2022).

# PISA education data: https://nces.ed.gov/surveys/pisa/idepisa/report.aspx
# The original script uses local Excel files. For reproducibility, it's better to point to URLs if available,
# or ensure the files are in a known relative path (e.g., 'data/' subdirectory).

try:
    pisa_reading_2022 = (
        pd.read_excel(
            "../../data/PISA_reading_2022.xlsx", # Assumes file is in 'data' subdirectory
            skiprows=list(range(0, 11)) + list(range(112, 117)),
            usecols=["Jurisdiction", "Average", "Standard Error"],
            na_values=["†", "—"],
        )
        .dropna(subset=["Average"])
        .rename(columns={"Jurisdiction": "country", "Average": "PISA_Avg_Reading_2022"})
        .replace(
            {
                "Russian Federation": "Russia",
                "United Arab Emirates": "UAE",
                "United Kingdom": "UK",
                "United States": "USA",
            }
        )
        .astype({"PISA_Avg_Reading_2022": float})
    )
    print("PISA 2022 reading data loaded successfully.")
    # print(pisa_reading_2022.head())
except FileNotFoundError:
    print("PISA_reading_2022.xlsx not found in 'data/' subdirectory. Please ensure the file exists.")
    pisa_reading_2022 = pd.DataFrame() 

try:
    pisa_reading_2018 = (
        pd.read_excel(
            "../../data/PISA_reading_2018.xlsx", # Assumes file is in 'data' subdirectory
            skiprows=list(range(0, 11)) + list(range(112, 117)),
            usecols=["Jurisdiction", "Average", "Standard Error"],
            na_values=["†", "—"],
        )
        .dropna(subset=["Average"])
        .rename(columns={"Jurisdiction": "country", "Average": "PISA_Avg_Reading_2018"})
        .replace(
            {
                "B-S-J-Z (China)": "China",
                "Russian Federation": "Russia",
                "United Arab Emirates": "UAE",
                "United Kingdom": "UK",
                "United States": "USA",
            }
        )
        .astype({"PISA_Avg_Reading_2018": float})
    )
    print("PISA 2018 reading data loaded successfully.")
    # print(pisa_reading_2018.head())
except FileNotFoundError:
    print("PISA_reading_2018.xlsx not found in 'data/' subdirectory. Please ensure the file exists.")
    pisa_reading_2018 = pd.DataFrame()
PISA 2022 reading data loaded successfully.
PISA 2018 reading data loaded successfully.

3.2. Merging PISA Data with Wallet Data#

We calculate wallet reporting rates (proportion of ‘100’ responses) per country and merge this with the PISA data.

if not df.empty and not pisa_reading_2022.empty:
    # Calculate wallet reporting rates (proportion of '100' responses)
    # Ensure 'response' is treated appropriately if it's not already string '0' or '100'
    # The original data seems to use '0' and '100' as strings/categories.
    wallet_rates = (
        df.groupby("country", observed=True)["response"]
        .value_counts(normalize=True)
        .rename('proportion') # Rename before unstacking
        .unstack(fill_value=0) # Fill with 0 if a category ('0' or '100') is missing for a country
    )
    
    # Select the proportion for '100' responses. Handle if '100' column doesn't exist for some reason.
    if '100' in wallet_rates.columns:
        wallet_rates = wallet_rates[['100']].reset_index().rename(columns={"100": "wallet_return_rate"})
    elif 100 in wallet_rates.columns: # If it was numeric
        wallet_rates = wallet_rates[[100]].reset_index().rename(columns={100: "wallet_return_rate"})
    else:
        print("Column '100' for response not found in wallet_rates. Check response values.")
        # Create an empty df to avoid errors, or handle as appropriate
        wallet_rates = pd.DataFrame(columns=['country', 'wallet_return_rate'])

    # Merge with PISA 2022 data
    reading_wallets = pisa_reading_2022.merge(wallet_rates, how="inner", on="country")
    # reading_wallets = reading_wallets.set_index("country") # Set index later if needed for specific plots/analyses

    print("Wallet rates merged with PISA 2022 data:")
    print(reading_wallets.head())

    # Correlation between PISA scores and wallet return rate
    if 'PISA_Avg_Reading_2022' in reading_wallets.columns and 'wallet_return_rate' in reading_wallets.columns:
        correlation_pisa_wallet = reading_wallets[["PISA_Avg_Reading_2022", "wallet_return_rate"]].corr()
        print("\nCorrelation between PISA 2022 Reading Scores and Wallet Return Rate:")
        print(correlation_pisa_wallet)

        # Scatter plot
        # fig_scatter_pisa_wallet = px.scatter(
        #     reading_wallets.reset_index(), 
        #     x="wallet_return_rate", 
        #     y="PISA_Avg_Reading_2022", 
        #     hover_data=["country"],
        #     title="PISA 2022 Reading Score vs. Wallet Return Rate by Country",
        #     trendline="ols" # Add a regression line
        # )
        # fig_scatter_pisa_wallet.show()
    else:
        print("Required columns for PISA vs Wallet correlation are missing.")
else:
    print("Skipping PISA data merging and analysis as main df or PISA df is empty.")
    reading_wallets = pd.DataFrame() # Ensure it exists for later cells
Wallet rates merged with PISA 2022 data:
          country  PISA_Avg_Reading_2022  Standard Error  wallet_return_rate
0       Australia             498.050940        2.013591            0.606516
1          Canada             507.132896        1.965629            0.550000
2           Chile             447.978957        2.632560            0.361111
3  Czech Republic             488.603669        2.246224            0.700000
4         Denmark             488.800991        2.578275            0.750000

Correlation between PISA 2022 Reading Scores and Wallet Return Rate:
                       PISA_Avg_Reading_2022  wallet_return_rate
PISA_Avg_Reading_2022                1.00000             0.78018
wallet_return_rate                   0.78018             1.00000

3.3. Commentary on PISA, Wallets, and Survey Measures (from wallet_pisa.ipynb)#

Tannenbaum Point 1 Context: The Tannenbaum paper aimed to validate survey measures of social capital using wallet report rates. Similarly, PISA education measures could also be seen as validating these survey measures. It’s interesting that the relationship between PISA scores and survey measures is very similar to that between wallet report rates and survey measures.

A key observation was the strong correlation between PISA reading scores and wallet reporting rates (rho ≈ 0.78 with 2022 data). This is greater than the correlation with many other “Economic and Institutional Performance” measures. This suggests a close link between societal honesty (as measured by wallet returns) and educational outcomes.

Takeaways/Thoughts:

  • Both honesty and education are challenging to influence directly through policy. The strong correlation might suggest that familial and community factors (components of social capital) are highly influential for educational outcomes.

  • Wallet report rates are a direct behavioral measure of social capital. While PISA scores are measures of institutional (educational system) performance, their strong correlation with wallet rates might imply they also reflect aspects of broader social capital.

3.4. Correlation Matrix including PISA, Wallet Response, Survey Measures, and SC Measures#

# if not df.empty and not pisa_reading_2022.empty:
#     # Prepare data for comprehensive correlation matrix
#     # Convert 'response' in the main df to numeric (e.g., 0 or 100) for correlation
#     df_corr_prep = df.copy()
#     # Assuming response categories are '0' and '100'. If they are already numeric, this won't harm.
#     # If they are other categories, this mapping needs adjustment.
#     if df_corr_prep['response'].dtype == 'category' or df_corr_prep['response'].dtype == 'object':
#         # Attempt to convert to numeric, coercing errors for non-numeric strings
#         df_corr_prep['response_numeric'] = pd.to_numeric(df_corr_prep['response'], errors='coerce')
#         # A common case is '0' and '100' as categories. If so, this works.
#         # If it's like 'returned', 'not_returned', map them, e.g. {'returned':100, 'not_returned':0}
#         # For this dataset, '0' and '100' are the categories, so direct numeric conversion is fine.
#     else:
#         df_corr_prep['response_numeric'] = df_corr_prep['response']
    
#     df_merged_for_corr = df_corr_prep.merge(pisa_reading_2022[['country', 'PISA_Avg_Reading_2022']], how='left', on='country')
    
#     cols_for_country_avg_corr = ['response_numeric', 'PISA_Avg_Reading_2022'] + numeric_features_initial + sc_measures
    
#     # Calculate country averages for these measures
#     country_avg_data = df_merged_for_corr.groupby("country", observed=False)[cols_for_country_avg_corr].mean()
    
#     # Compute the correlation matrix
#     comprehensive_corr_matrix = country_avg_data.corr()

#     print("\nComprehensive Correlation Matrix (Country Averages):")
#     print(comprehensive_corr_matrix)

#     # Visualize the correlation matrix
#     fig_comp_corr = px.imshow(
#         comprehensive_corr_matrix,
#         labels=dict(color="Correlation"),
#         color_continuous_scale="RdBu_r",
#         zmin=-1,
#         zmax=1,
#         title="Correlation Matrix: Wallet Response, PISA, Survey & SC Measures (Country Averages)",
#         text_auto=True # Show correlation values on heatmap
#     )
#     fig_comp_corr.update_layout(height=800)
#     fig_comp_corr.show()
# else:
#     print("Skipping comprehensive correlation matrix as main df or PISA df is empty.")

Observations from the Correlation Matrix:

  1. PISA scores (PISA_Avg_Reading_2022) generally show stronger correlations with wallet response_numeric than many individual survey-based social capital measures do.

  2. PISA scores also correlate with other macro social capital indicators (log_gdp, gee, etc.), sometimes more strongly than the wallet response rate itself does. This might suggest PISA scores capture a broad aspect of societal development and functional institutions, which overlaps with social capital.

  3. The survey measures (like general_trust, general_morality) show varied correlations with both wallet response and PISA scores, reinforcing the idea that these are related but distinct constructs.

3.5. Regression Analysis: Predictive Value of Wallet Return Rate for PISA Scores (from wallet_eda.py)#

Investigating if wallet return rates, when combined with other survey measures, significantly predict PISA scores. This explores Tannenbaum’s second point regarding predictive power, but with PISA as the outcome.

if not reading_wallets.empty and not df.empty:
    print("\nRegression Analysis: Predicting PISA Scores")
    print("Outcome variable: PISA_Avg_Reading_2022")
    print("Predictors: wallet_return_rate and one survey_measure at a time\n")
    
    # Ensure reading_wallets has 'country' as a column for merging, not index
    reading_wallets_reg = reading_wallets.copy()
    if reading_wallets_reg.index.name == 'country':
        reading_wallets_reg = reading_wallets_reg.reset_index()

    for measure in numeric_features_initial: # Using the initial numeric features as survey_measures
        # Get country-level average for the current survey measure
        country_avg_measure = df.groupby('country', observed=False)[[measure]].mean().reset_index()
        
        # Merge PISA/wallet data with the current survey measure average
        regression_df = pd.merge(reading_wallets_reg, country_avg_measure, on='country', how='inner')
        regression_df = regression_df.dropna(subset=['PISA_Avg_Reading_2022', 'wallet_return_rate', measure])

        if len(regression_df) < 3: # Need enough data points for regression
            print(f"Skipping regression for {measure} due to insufficient data after merge/dropna ({len(regression_df)} rows).")
            continue

        y = regression_df["PISA_Avg_Reading_2022"]
        X = regression_df[[measure, "wallet_return_rate"]]
        
        # Standardize predictors for better comparison of coefficients if desired, though OLS doesn't strictly require it
        X_std = (X - X.mean()) / X.std()
        X_std = sm.add_constant(X_std) # Add intercept

        try:
            model = sm.OLS(y, X_std)
            results = model.fit(cov_type="HC1") # Robust standard errors
            print(f"\n--- Regression Summary for PISA ~ {measure} + wallet_return_rate ---")
            print(results.summary())
        except Exception as e:
            print(f"Error during regression for {measure}: {e}")
            print(f"Data for {measure}:\n y_head: {y.head()}\n X_std_head:\n{X_std.head()}")
else:
    print("Skipping regression analysis as reading_wallets or main df is empty.")
Regression Analysis: Predicting PISA Scores
Outcome variable: PISA_Avg_Reading_2022
Predictors: wallet_return_rate and one survey_measure at a time


--- Regression Summary for PISA ~ general_trust + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.604
Model:                               OLS   Adj. R-squared:                  0.577
Method:                    Least Squares   F-statistic:                     26.93
Date:                   Sat, 31 May 2025   Prob (F-statistic):           2.01e-07
Time:                           16:42:08   Log-Likelihood:                -157.19
No. Observations:                     33   AIC:                             320.4
Df Residuals:                         30   BIC:                             324.9
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                451.7902      5.175     87.309      0.000     441.648     461.932
general_trust         -1.5422      6.235     -0.247      0.805     -13.763      10.679
wallet_return_rate    36.5986      6.682      5.477      0.000      23.502      49.695
==============================================================================
Omnibus:                        2.482   Durbin-Watson:                   1.603
Prob(Omnibus):                  0.289   Jarque-Bera (JB):                1.355
Skew:                          -0.137   Prob(JB):                        0.508
Kurtosis:                       2.046   Cond. No.                         2.38
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ GPS_trust + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.569
Model:                               OLS   Adj. R-squared:                  0.537
Method:                    Least Squares   F-statistic:                     18.72
Date:                   Sat, 31 May 2025   Prob (F-statistic):           7.94e-06
Time:                           16:42:08   Log-Likelihood:                -143.82
No. Observations:                     30   AIC:                             293.6
Df Residuals:                         27   BIC:                             297.8
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                449.0720      5.625     79.839      0.000     438.048     460.096
GPS_trust              2.2726      4.907      0.463      0.643      -7.346      11.891
wallet_return_rate    34.0853      5.574      6.115      0.000      23.161      45.010
==============================================================================
Omnibus:                        3.577   Durbin-Watson:                   1.677
Prob(Omnibus):                  0.167   Jarque-Bera (JB):                1.820
Skew:                          -0.300   Prob(JB):                        0.402
Kurtosis:                       1.953   Cond. No.                         1.02
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ general_morality + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.649
Model:                               OLS   Adj. R-squared:                  0.625
Method:                    Least Squares   F-statistic:                     27.12
Date:                   Sat, 31 May 2025   Prob (F-statistic):           1.88e-07
Time:                           16:42:08   Log-Likelihood:                -155.21
No. Observations:                     33   AIC:                             316.4
Df Residuals:                         30   BIC:                             320.9
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                451.7902      4.873     92.713      0.000     442.239     461.341
general_morality      12.2621      4.745      2.584      0.010       2.961      21.563
wallet_return_rate    28.0774      5.309      5.289      0.000      17.672      38.482
==============================================================================
Omnibus:                        0.676   Durbin-Watson:                   1.755
Prob(Omnibus):                  0.713   Jarque-Bera (JB):                0.702
Skew:                          -0.087   Prob(JB):                        0.704
Kurtosis:                       2.307   Cond. No.                         2.02
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ MFQ_genmorality + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.556
Model:                               OLS   Adj. R-squared:                  0.523
Method:                    Least Squares   F-statistic:                     24.32
Date:                   Sat, 31 May 2025   Prob (F-statistic):           9.14e-07
Time:                           16:42:08   Log-Likelihood:                -141.27
No. Observations:                     30   AIC:                             288.5
Df Residuals:                         27   BIC:                             292.7
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                458.1024      5.166     88.681      0.000     447.978     468.227
MFQ_genmorality        6.1031      6.144      0.993      0.321      -5.939      18.145
wallet_return_rate    27.8549      4.136      6.734      0.000      19.748      35.962
==============================================================================
Omnibus:                        0.619   Durbin-Watson:                   1.583
Prob(Omnibus):                  0.734   Jarque-Bera (JB):                0.699
Skew:                          -0.178   Prob(JB):                        0.705
Kurtosis:                       2.342   Cond. No.                         1.44
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ civic_cooperation + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.612
Model:                               OLS   Adj. R-squared:                  0.585
Method:                    Least Squares   F-statistic:                     33.78
Date:                   Sat, 31 May 2025   Prob (F-statistic):           2.66e-08
Time:                           16:42:08   Log-Likelihood:                -152.46
No. Observations:                     32   AIC:                             310.9
Df Residuals:                         29   BIC:                             315.3
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                451.1013      5.269     85.609      0.000     440.774     461.429
civic_cooperation      1.3957      6.486      0.215      0.830     -11.316      14.107
wallet_return_rate    35.5125      7.125      4.984      0.000      21.548      49.478
==============================================================================
Omnibus:                        2.251   Durbin-Watson:                   1.748
Prob(Omnibus):                  0.325   Jarque-Bera (JB):                1.304
Skew:                          -0.159   Prob(JB):                        0.521
Kurtosis:                       2.063   Cond. No.                         1.70
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ GPS_posrecip + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.598
Model:                               OLS   Adj. R-squared:                  0.568
Method:                    Least Squares   F-statistic:                     24.82
Date:                   Sat, 31 May 2025   Prob (F-statistic):           7.64e-07
Time:                           16:42:08   Log-Likelihood:                -142.78
No. Observations:                     30   AIC:                             291.6
Df Residuals:                         27   BIC:                             295.8
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                449.0720      5.433     82.655      0.000     438.423     459.721
GPS_posrecip          -8.0259      4.487     -1.789      0.074     -16.820       0.768
wallet_return_rate    33.8260      4.804      7.041      0.000      24.410      43.242
==============================================================================
Omnibus:                        2.471   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.291   Jarque-Bera (JB):                1.255
Skew:                          -0.006   Prob(JB):                        0.534
Kurtosis:                       1.998   Cond. No.                         1.03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ GPS_altruism + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.567
Model:                               OLS   Adj. R-squared:                  0.534
Method:                    Least Squares   F-statistic:                     20.42
Date:                   Sat, 31 May 2025   Prob (F-statistic):           3.97e-06
Time:                           16:42:08   Log-Likelihood:                -143.90
No. Observations:                     30   AIC:                             293.8
Df Residuals:                         27   BIC:                             298.0
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                449.0720      5.640     79.627      0.000     438.018     460.126
GPS_altruism          -0.6875      5.752     -0.120      0.905     -11.962      10.587
wallet_return_rate    33.9253      5.325      6.371      0.000      23.488      44.362
==============================================================================
Omnibus:                        3.353   Durbin-Watson:                   1.686
Prob(Omnibus):                  0.187   Jarque-Bera (JB):                1.643
Skew:                          -0.232   Prob(JB):                        0.440
Kurtosis:                       1.952   Cond. No.                         1.24
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

--- Regression Summary for PISA ~ stranger1 + wallet_return_rate ---
                              OLS Regression Results                             
=================================================================================
Dep. Variable:     PISA_Avg_Reading_2022   R-squared:                       0.610
Model:                               OLS   Adj. R-squared:                  0.584
Method:                    Least Squares   F-statistic:                     25.18
Date:                   Sat, 31 May 2025   Prob (F-statistic):           3.81e-07
Time:                           16:42:08   Log-Likelihood:                -156.87
No. Observations:                     33   AIC:                             319.7
Df Residuals:                         30   BIC:                             324.2
Df Model:                              2                                         
Covariance Type:                     HC1                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                449.6310      5.125     87.733      0.000     439.586     459.676
stranger1              6.1500      6.026      1.021      0.307      -5.661      17.961
wallet_return_rate    31.0808      5.934      5.238      0.000      19.451      42.710
==============================================================================
Omnibus:                        2.488   Durbin-Watson:                   1.707
Prob(Omnibus):                  0.288   Jarque-Bera (JB):                1.429
Skew:                          -0.198   Prob(JB):                        0.489
Kurtosis:                       2.060   Cond. No.                         2.35
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)

Commentary on Regression (from wallet_pisa.ipynb - Tannenbaum Point 2 Context): The Tannenbaum paper discusses four measures of “Economic and Institutional Performance”: GDP per capita (log_gdp), productivity (log_tfp), government effectiveness (gee), and letter grade efficiency (letter_grading). If PISA scores are considered a fifth measure of this type, the analysis above suggests that wallet reporting rates can be an effective predictor.

When wallet reporting rates are combined with other survey measures of social capital (as in the regressions above), the coefficient for wallet reporting rates often remains statistically significant. This implies that the behavioral measure (wallet returns) offers predictive power for educational outcomes (PISA scores) beyond what is captured by attitudinal survey measures alone. The R-squared values from these models can be compared to those in Tannenbaum’s paper for other institutional outcomes.

4. Machine Learning: Predicting Wallet Response#

Building models to predict whether a wallet is reported based on various features. We’ll first use the original features, then incorporate PISA scores to see if they improve predictive accuracy (at an individual prediction level, which is a different task than country-level prediction/correlation).

4.1. Preprocessing and Model without PISA Data#

if not df.empty:
    # Define features for this model
    numeric_features_ml1 = [
        "general_trust", "GPS_trust", "general_morality", "MFQ_genmorality",
        "civic_cooperation", "GPS_posrecip", "GPS_altruism", "stranger1"
    ]
    categorical_features_ml1 = [
        "male", "above40", "computer", "coworkers", "other_bystanders", 
        "institution", "cond"
        # Excluding other categorical features like security_cam, etc. as in original script for this part
    ]

    # Numeric feature pipeline
    numeric_transformer = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
    )

    # Categorical feature pipeline
    categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")), # Added imputer for categorical
            ("encoder", OneHotEncoder(handle_unknown="ignore")),
            # SelectPercentile for categorical features after OHE might be less common than for numeric.
            # Chi2 requires non-negative features, which OHE provides.
            ("selector", SelectPercentile(chi2, percentile=50)) 
        ]
    )

    preprocessor1 = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_features_ml1),
            ("cat", categorical_transformer, categorical_features_ml1),
        ]
    )

    # Target variable: needs to be numeric for scikit-learn classifiers
    # Assuming 'response' categories are '0' and '100'. We map them to 0 and 1.
    y = df["response"].cat.codes # This will map '0' to 0 and '100' to 1 if these are the categories
    X = df.drop(columns="response")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=23, stratify=y # Stratify by y for classification
    )

    # Example classifier: SVC (as in original script)
    # The original script tested various classifiers; SVC with class_weight='balanced' was highlighted.
    clf1 = Pipeline(
        steps=[
            ("preprocessor", preprocessor1),
            ("classifier", SVC(random_state=123, class_weight="balanced"))
        ]
    )

    # Weighting observations (as explored in original script)
    # This was an experiment; its effectiveness can vary.
    cond_counts = X_train["cond"].value_counts(normalize=True)
    weights = X_train["cond"].map(lambda x: 1 / cond_counts[x] if x in cond_counts else 1.0)

    try:
        clf1.fit(X_train, y_train, classifier__sample_weight=weights)
        score1 = clf1.score(X_test, y_test)
        print(f"Model 1 (without PISA) score: {score1:.3f}")
    except Exception as e:
        print(f"Error fitting Model 1: {e}")
        print("Check feature lists and preprocessor compatibility with selected classifier.")

else:
    print("DataFrame is empty, skipping ML model training.")
Model 1 (without PISA) score: 0.680

4.2. Preprocessing and Model with PISA Data (and other SC measures)#

Here, we add country-level PISA scores and other social capital measures (sc_measures) to the feature set. Note that PISA scores are country-level, so they will be constant for all individuals within the same country. This might limit their predictive power for individual responses but is included as per the original exploration.

if not df.empty and not pisa_reading_2022.empty:
    df_with_pisa = df.merge(pisa_reading_2022[['country', 'PISA_Avg_Reading_2022']], how="left", on="country")
    # Impute PISA scores if some countries in df don't have them (e.g., with mean PISA score)
    df_with_pisa['PISA_Avg_Reading_2022'] = df_with_pisa['PISA_Avg_Reading_2022'].fillna(df_with_pisa['PISA_Avg_Reading_2022'].mean())

    numeric_features_ml2 = numeric_features_initial + sc_measures + ['PISA_Avg_Reading_2022']
    # Ensure all numeric features actually exist in df_with_pisa and are numeric
    valid_numeric_features_ml2 = []
    for col in numeric_features_ml2:
        if col in df_with_pisa.columns and pd.api.types.is_numeric_dtype(df_with_pisa[col]):
            valid_numeric_features_ml2.append(col)
        elif col in df_with_pisa.columns: # If column exists but not numeric, try to convert
            df_with_pisa[col] = pd.to_numeric(df_with_pisa[col], errors='coerce')
            if pd.api.types.is_numeric_dtype(df_with_pisa[col]):
                 valid_numeric_features_ml2.append(col)
            else:
                print(f"Warning: Column '{col}' could not be converted to numeric and will be excluded from numeric features for ML.")
        else:
            print(f"Warning: Column '{col}' not found in dataframe and will be excluded from numeric features for ML.")
    numeric_features_ml2 = valid_numeric_features_ml2
    
    # Categorical features from original script for this model
    categorical_features_ml2 = [
        "male", "above40", "computer", "coworkers", "other_bystanders", 
        "institution", "cond", "security_cam", "security_guard", 
        "local_recipient", "no_english", "understood_situation"
    ]
    valid_categorical_features_ml2 = [col for col in categorical_features_ml2 if col in df_with_pisa.columns]
    categorical_features_ml2 = valid_categorical_features_ml2

    # Numeric transformer (can reuse the previous one if strategy is the same)
    numeric_transformer_ml2 = Pipeline(
        steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
    )

    # Categorical transformer (can reuse)
    categorical_transformer_ml2 = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown="ignore")),
            ("selector", SelectPercentile(chi2, percentile=50)), 
        ]
    )

    preprocessor2 = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer_ml2, numeric_features_ml2),
            ("cat", categorical_transformer_ml2, categorical_features_ml2),
        ],
        remainder='drop' # Drop any columns not specified
    )

    # Target variable
    y2 = df_with_pisa["response"].cat.codes
    X2 = df_with_pisa.drop(columns="response")

    X_train2, X_test2, y_train2, y_test2 = train_test_split(
        X2, y2, test_size=0.2, random_state=23, stratify=y2
    )

    # Example classifier: RandomForest (as explored in original script)
    # The original script commented out RandomForest, let's try it or stick to SVC for comparison
    clf2 = Pipeline(
        steps=[
            ("preprocessor", preprocessor2),
            # ("classifier", SVC(random_state=123, class_weight="balanced")) 
            ("classifier", RandomForestClassifier(max_depth=8, random_state=0, class_weight="balanced"))
        ]
    )
    
    # Weighting (optional, as before)
    # cond_counts2 = X_train2["cond"].value_counts(normalize=True)
    # weights2 = X_train2["cond"].map(lambda x: 1 / cond_counts2[x] if x in cond_counts2 else 1.0)

    try:
        # clf2.fit(X_train2, y_train2, classifier__sample_weight=weights2) # If using weights
        if not X_train2.empty and len(numeric_features_ml2) + len(categorical_features_ml2) > 0:
            clf2.fit(X_train2, y_train2)
            score2 = clf2.score(X_test2, y_test2)
            print(f"Model 2 (with PISA & SC measures) score: {score2:.3f}")
        else:
            print("Skipping Model 2 fitting due to no valid features or empty training data.")
    except Exception as e:
        print(f"Error fitting Model 2: {e}")
        print(f"Numeric features used: {numeric_features_ml2}")
        print(f"Categorical features used: {categorical_features_ml2}")
        print("Check feature lists, data types, and preprocessor compatibility.")
else:
    print("DataFrame or PISA data is empty, skipping ML model training with PISA features.")
Model 2 (with PISA & SC measures) score: 0.681

4.3. Commentary on Machine Learning Results (from wallet_pisa.ipynb)#

Using a comprehensive set of variables (experiment variables, survey measures, economic/institutional performance measures, PISA scores), the predictive accuracy for individual wallet responses did not dramatically improve, typically hovering around 0.68 - 0.70 (depending on the model and specific run, the original script mentioned ~0.687).

This level of accuracy is somewhat expected. Predicting individual acts of honesty with high precision based on available (mostly demographic or country-level) data is inherently challenging. Individual behavior is complex and influenced by many immediate situational factors not captured in these datasets.

5. Summary and Additional Findings/Thoughts#

This integrated analysis confirms several key points and reveals interesting connections:

  1. Wallet Return Rates as a Measure of Social Capital: The EDA and correlations support the idea that wallet return rates are a valid behavioral measure related to social capital, aligning with survey-based measures.

  2. PISA Scores and Social Capital: PISA educational achievement scores show a surprisingly strong positive correlation with country-level wallet return rates. This suggests that factors contributing to higher educational outcomes may overlap significantly with those fostering civic honesty and broader social capital. The correlation (around 0.78) is notable.

  3. Predictive Power:

    • At the country level, wallet return rates (and PISA scores) show predictive power for various economic and institutional indicators (as explored by Tannenbaum and extended here with PISA in regressions).

    • At the individual level, predicting whether a specific person will return a wallet based on the available features (even including PISA scores as a country-level proxy) remains difficult, with model accuracies around 68-70%. This highlights the difference between aggregate societal trends and individual behavior prediction.

  4. Feature Importance (Implicit): While not explicitly detailed with feature importance plots from the ML models in this combined notebook, the varying correlations in the association matrix and the regression coefficients hint at which factors are more strongly related to wallet returns and PISA scores. For instance, general_morality and general_trust often show some of the higher correlations among survey measures.

Further Exploration Ideas:

  • Interaction Effects: Explore interaction terms in regression models (e.g., does the relationship between a survey measure and PISA scores change depending on the level of wallet return rates in a country?).

  • Clustering: Could countries be clustered based on their profiles of wallet returns, PISA scores, and survey measures? This might reveal different typologies of social capital and development.

  • Longitudinal Analysis: If more time-series data were available for wallet experiments and PISA scores, one could explore how changes in these measures correlate over time within countries.

  • Qualitative Insights: The quantitative data could be supplemented with qualitative case studies of countries that are outliers (e.g., high PISA but lower wallet returns, or vice-versa) to understand contextual factors.

  • Advanced ML Feature Importance: For the ML models, using techniques like SHAP values or permutation importance could provide clearer insights into which features are most predictive of individual wallet returns, even if overall accuracy is modest.